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Abstract 

We study the posterior contraction behavior of the latent population structure that 
arises in admixture models as the amount of data increases. An admixture model — 
alternatively known as a topic model — specifies k populations, each of which is char- 
acterized by a A d -valued vector of frequencies for generating a set of discrete values 
in {0, 1, ... , d}. The population polytope is defined as the convex hull of the k fre- 
quency vectors. Under the admixture specification, each of m individuals generates 
an i.i.d. frequency vector according to a probability distribution defined on the (un- 
known) population polytope Go, and then generates n data points according to the 
sampled frequency vector. Rates of posterior contraction are established with respect 
to Hausdorff metric and a minimum matching Euclidean menic defined on population 
polytopes, as the amount of data m x n tends to infinity. Minimax lower bounds are 
also established. Tools developed include posterior asymptotics of hierarchical models 
with m x n data, and arguments from convex geometry. 



1 Introduction 

We study a class of hierarchical mixture models for categorical data known as the admix- 
tures, which w ere independen t ly dey eloped in the landmark pa pers by Pritchard , Stephens 



and Donnelly UPritchard et all 1200011 and Blei, Ng and Jordan MBlei et all 120031. The for- 
mer set of authors applied their modeling to population genetics, while the latter considered 
applications in text processing and computer vision, where their models are more widely 
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known as the latent Dirichlet allocation model, or a topic model. Admixture modeling has 
been applied to and extended in a vast number of fields of engineering and sciences — in 
fact, the Google scholar pages for these two original papers alone combine for more than a 
dozen thousands of citations. Inspite of their wide uses, asymptotic behavior of hierarchical 
models such as the admixtures remains largely unexplored, to the best of our knowledge. 

A finite admixture model posits that there are k populations, each of which is char- 
acterized by a A d -valued vector Oj of frequencies for generating a set of discrete values 
{0,1,..., d}, for j = 1, ... , k. Here, A d is the d-dimensional probability simplex. A sam- 
pled individual may have mixed ancestry and as a result inherits some fraction of its values 
from each of its ancestral populations. Thus, an individual is associated with a proportion 
vector (5 = {f3\, . . . , G A fc_1 , where (3j denotes the proportion of the individual's data 
that are generated according to population j's frequency vector Oj. This yields a vector of 
frequencies r/ = Y^j=iPj@j ^ ^ d associated with that individual. In most applications, 
one does not observe r] directly, but rather an i.i.d. sample generated from a multinomial 
distribution parameterized by r}. The collection of 0\, . . . , Of. is refered to as the popula- 
tion structure in the admixture. In population genetics modeling, Oj represents the allele 
frequencies at each locus in an individual's genome from the j-th population. In text doc- 
ument modeling, Oj represents the frequencies of words generated by the j-th topic, while 
an individual is a document, i.e., a collection of words. In computer vision, Oj represents 
the frequencies of objects generated by the j-th scenary topic, while an individual is a nat- 
ural image, i.e., a collection of scenary objects. The primary interest is the inference of the 
population structure on the basis of sampled data. In a Bayesian estimation setting, the pop- 
ulation structure is assumed random and endowed with a prior distribution — accordingly 
one is interested in the behavior of the posterior distribution of the population structure 
given the available data. 

The goal of this paper is to obtain contraction rates of the posterior distribution of 
the latent population structure that arises in admixture models, as the amount of data in- 
creases. Admixture models present a canonical mixture model for categorical data in 
which the population structure provides the support for the mixing measure. Existing 
works on convergence behavior of mixing measures in a mixture model are quite rare, 
in either frequentist or Bayesian estimation literature. Chen provided the op timal conver- 
gence rate of mixing measures in several finite mixtures for univariate data IChenl II 199511 
(see also llshwaran et al.1 1200 ill ). R ecent progress on multivar i ate m ixture models include 
papers by Rousseau and Mengersen iRousseau and Mengersenl 0201111 and Nguyen iNguyen 
11201 2TI . In iNguyenl 1201211 posterior contraction rates of mixing measures in several finite 
and infinite mixture models for multivariate and continuous data were obtained. Tous- 
sile and Ga ssiat established cons i stency of a penalized MLE procedure for a finite admix- 
ture model iToussile and Gassiatl 0200911 . This issu e has also attracte d increased attention 
in machine learning. Recen t papers by Arora et al lArora et al.l 1120 1211 and Anandkumar et 
al lAnandkumar et al.l 1120 1211 study convergence properties of certain computationally effi- 
cient learning algorithms based on matrix factorization techniques. 

There are a number of questions that arise in the convergence analysis of admixture 
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models for categorical data. The first question is to find a suitable metric in order to es- 
tablish rates of convergence. It would be ideal to establish convergence for each individual 
element L , for % = 1, . . . , k. This is a challenging due to the problems of identifi ability. 
A (relatively) minor issue is known as "label-switching" problem. That is, one can identify 
the collection of 6 \ 's only up to a permutation. A deeper problem is that any Oj that can be 
expressed as a convex combination of the others 0j> for j' ^ j may be difficult to identify, 
estimate, and analyze. To get around this difficulty, we propose to study the convergence 
of population structure variables through its convex hull G = conv(#i, . . . ,6^), which 
shall be referred to as the population polytope. Convergence of convex polytopes can be 
ev aluated in t e rms o f Hausdorff metric dy_, a metric commonly utilized in convex geome- 
try [Schneide^ (l99$\. Moreover, under some geometric identifi ability conditions, it can be 



shown that convergence in Hausdorff metric entails convergence of all extreme points of 
the polytope via a minimum-matching distance metric (defined in Section [2]). This is the 
theory we aim for in this paper. Convergence behavior of (the posterior of) non-extreme 
points among 0±, . . . , 0^ remains unresolved as of this writing. 

The second question in an asymptotic study of a hierarchical model is how to address 
multiple quantities that define the amount of empirical data. The admixture model we 
consider has two asymptotic quantities that play asymmetric roles — m is the number of 
individuals, and n is the number of data points associated with each individual. Both m and 
n are allowed to increase to infinity. A simple way to think about this asymptotic setting 
is to let m go to infinity, while n := n(m) tends to infinity at a certain rate which may be 
constrained with respect to m. Let II be a prior distribution on variables Oi, ... ,0k- The 
goal is to derive a vanishing sequence of 5 mjn , depending on both m and n, such that the 
posterior distribution of the Oi 's satisfies, for some sufficiently large constant C, 



S™ 1 I -> 



U\d H {G,Go)>CS n 

in PJ? ^ | Go -probability as m — > oo and n = n(m) — > oo suitably. Here, P™ denotes 
the true distribution associated with population polytope Go that generates a given m x n 
data set . As mentioned, 5 m n is also the posterior contraction rate for the extreme 
points among population structure variables 0\, . . . , Ok- 

Overview of results. Suppose that n — > oo at a rate constrained by log m < n and 
logn = o{m). In an overfitted setting, i.e., when the true population polytope may have 
less than k extreme points, we show that under some mild identifiability conditions the 

posterior contraction rate in either Hausdorff or minimum- matching distance metric is 

i 

2(p+a) 

, where p = (k — 1) A d is the intrinsic dimension 



logm _|_ logn _|_ logn 



of the population polytope while a denotes the regularity level near boundary of the sup- 
port of the density function for r]. On the other hand, if either the true population polytope 
is known to have exactly k extreme points, or if the pairwise distances among the extreme 
points are bounded from below by a known positive constant, then the contraction rate is 
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n 
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log n 
m 



2(l+«) 



improved to a parametric rate 8 m>n >c '"t!' 1 + 

The constraints on n = n(m), and the appearance of quantity log n/m in the conver- 
gence rate are quite interesting. Both the constraints and the derived rate are rooted in a 
condition on the required thickness of the prior support of the marginal densities of the data 
and an upper bound on the entropy of the space of such densities. This suggests an inter- 
esting interaction between layers in the latent hierarchy of the admixture model worthy of 
further investigation. For instance, it is not clear whether posterior consistency continues 
to hold if n falls outside of the specified range, and what effects this has on convergence 
rates, with or without additional assumptions on the data. This appears quite difficult with 
our present set of techniques. 

We also establish minimax lower bounds for both settings. In the overfitted setting, the 
obtained lower bound is (mn)~ 1 /( q+a \ where q = [k/2\ A d, unless additional constraints 
are imposed on the prior. Although this lower bound does not quite match with the poste- 
rior contraction rate, the two are qualitatively comparable and both notably dependent on 
dimensionality d. In particular, ifnxm, and k > 2d, the posterior contraction rate be- 
comes (log m/m)~ 2 ( d + a ) . Compare this to the lower bound m~ 2 /( d+a \ whose exponent 
differs by only a factor of 4. 

Method of proofs and tools. The ge neral framework of posterior asymptotics for den- 
sity esti mation has been well - established Ghosal et al. Ishen and Wassermanl ibooih 



(see al so B arr on et al.l 0199911 , iGhosh and Ramamoorthil 11200211 , IWalk er [2004]] , iGhosal and van der Vaart 
Il2007ll . I Walker et al.l |2007[]). This framework continues to be very useful, but the analysis 
of mixing measure estimation in multi-level models presents distinct new challenges. In 
Section @] we shall formulate an abstract theorem (Theorem |4} on posterior contraction of 
latent variable s of interest i n an admixture model, given m x n data, by reposing on the 
framework of IGhosal et al.l Il2000ll (see also INguyenl 1120 1 211 ^ - The main novelty here is that 
we work on the space of latent variables (e.g., space of latent population structures endowed 
with Hausdorff or comparable metric) as opposed to the space of data densities endowed 
with Hellinger metric. A basic quantity is the Hellinger information of the Hausdorff metric 
for a given subset of polytopes. Indeed, the Hellinger information is a fundamental quan- 
tity running through the analysis, which ties together the amount of data m and n — key 
quantities that are associated with different levels in the model hierarchy. 

The bulk of the paper is devoted to establishing properties of the Hellinger information, 
which are fed into Theorem |4] so as to obtain concrete convergence rates. This is achieved 
through a number inequalities which illuminate the relationship between Hausdorff distance 
of a given pair of population polytopes G, G', and divergence functionals (e.g., Kullback- 
Leibler divergence or total variational distance) of the induced marginal data densities. The 
technical challenges lie in the fact that in order to relate G to the marginal density of the 
data, one has to integrate out multiple layers of latent variables, rj and [3. Techniques in 
convex geometry come in very handily in the derivation of both lower and upper bounds 



Schneided 0199311 . 



The remainder of the paper is organized as follows. The model and main results are de- 
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scribed in Section [2] Section [3] describes the basic geometric assumptions and their conse- 
quences. An abstract theorem for posterior contraction for m x n data setting is formulated 
in Section |4l whose conditions are verified in the subsequent sections. Section [5] proves a 
contraction result which helps to establish a key lower bound on the Hellinger information, 
while Section [6] provides a lower bound on Kullback-Leibler neighborhoods of the prior 
support (that is, a bound the prior thickness). Proofs of main theorems and other technical 
lemmas are presented in Section [7]and the Appendices. 

Notations. B p (0, r) denotes a closed p-dimensional Euclidean ball centered at and has 
radius r. G e denotes the Minkowsky sum G e := G + 2?<2+i(0, e). bd G, extr G, Diam G, 
aff G, volp G denote the boundary, the set of extreme points, the diameter, the affine span, 
and the p-dimensional volume of set G, respectively. "Extreme points" and "vertices" are 
interchangeable throughout this paper. Set-theoretic difference between two sets is defined 
asGAG' = (G\ G') U (G' \ G). N(e, Q, d H ) denotes the covering number of Q in Haus- 
dorff metric du- D(e,G,du) is the packing number of Q in Hausdorff metric. Several diver- 
gence measures for probability distributions are employed: K(p, q), h(p, q), V(p, q) denote 
Kullback-Leibler divergence, Hellinger and total variation distance between two densities 
p and q defined with respect to a measure on a common space: K(p,q) = f plog(p/q), 
h 2 (p, q) = 5 / (y/p — xfqj 2 and V(P, Q) = \ J \p — q\. In addition, we define K 2 = 
f p[log(p / q)] 2 . Throughout the paper, f(m,n,e) < g(m,n,e), equivalently, / = 0(g), 
means f(m, n, e) < Cg(m, n, e) for some constant C independent of asymptotic quantities 
m, n and e - details about the dependence of C are made explicit unless obvious from the 
context. Similarly, f(m, n, e) > g(m, n, e) or / = Cl(g) means f(m, n, e) > Cg(m, n, e). 



2 Main results 



Model description. As mentioned in the introduction, the central objects of the admix- 
ture model are population structure variables (0\, . . . , 9^), whose convex hull is called the 
population polytope: G = conv(#i, . . . , 6k). 9\, ...,6k reside in d-dimensional probabil- 
ity simplex A d . k < oo is assumed known. Note that G has at most k vertices (i.e. extreme 
points) among Ox, ...,0 k- 

A random vector r\ G G is parameterized by r/ = f3\6\ + . . . , (3^6^, where (3 = 
(/3i, . . . ,/3k) G A fc_1 is a random vector distributed according to a d istribution Pm^ for 



some parameter 7 (both iPritchard et all Il2000ll and iBlei et al.l 11200311 used the Dirichlet 



distribution). Given 6x, ■ ■ ■ , 6^ this induces a probability distribution P v \g whose support 
is the convex set G. Details of this distribution, suppressed for the time being, are given 
explicitly by Eq. ([141 ) and (fT3T ). [To be precise P v \g should be written as P rj \g 1} „^g k ^G- That 
is, G is always attached with a specific set of 6j 's. Throughout the paper, this specification 
of G is always understood but notationally suppressed to avoid cluttering.] 

For each individual i = 1, . . . , m, let rj i G A d be an independent random vector dis- 
tributed by P v \g- The observed data associated with i, <Sfi = (Xij)J =1 are assumed to be 
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i.i.d. draws from the multinomial distribution Mult^) specified by ri i := (rjio, . . . , rjid). 
That is, Xij 6 {0, . . . , d} such that P(Xij = l\r)j) = r\n for I = 0, ... , d. 

Admixture models are simple when specified in a hierarchical manner as given above. 
The relevant distributions are written down below. The joint distribution of the generic 
random variable r\ and n-vector 5r n i (dropping superscript i used for indexing a specific 
individual) is denoted by P v xSt n AG an d its density p v xS [n] \G- We have 

n d 

P v xs ln] \G(Vi,S[n]) = P v \a(rii) x nil^ =0 - (1) 

3=1 1=0 

The distribution of 5[ n ], denoted by Ps [n] \G> is obtained by integrating out 77, which yields 
the following density with respect to counting measure: 

„ n d 

Ps ln] \G(SU) = / II II 4 Xii=l) dPn\G{rii). (2) 
JG 3=11=0 

The joint distribution of the full data set := denoted by P^ is aproduct 

distribution: 

m 

%| G (* ] ):=II^ W |g(4])- (3) 



Ad mixture models a re cus tomarily introduced in an equivalent way as follows iBlei et al 



[20031. iPritchard et all 0200011 : For each i = l,...,m, draw an independent random vari- 



able (3 e A fe_1 as (3 ~ Pp ]y . Given i and (3, for j = 1, . . . , n, draw Zij\0 ~ Mult(/3). Z 
takes values in {1, . . . , k}. Now, data point is randomly generated by Xy\Zy = I, 6 ~ 
Mult(0/). This yields the same joint distribution of Sf, = (^ij)j =1 as the one described 
earlier. The use of latent variables Zij is amenable to the development of computational al- 
gorithms for inference. However, this representation bears no significance within the scope 
of this work. 



1.1 



Asymptotic setting and metrics on population polytopes. Assume the data set S 1 ^' = 
of size m x n is generated according an admixture model given by "true" pa- 
rameters 61, . . . , 6* k . Go = conv(<9*, . . . , 0* k ) is the true population polytope. Under the 
Bayesian estimation framework, the population structure variables (0%, . . . , Of.) are random 
and endowed with a prior distribution IT. The main question to be addressed in this paper 
is the contraction behavior of the posterior distribution n(G|«sj^), as the number of data 
points m x n goes to infinity. 

It is noted that we do not always assume that the number of extreme points of the pop- 
ulation polytope Go is k. We work in a general overfitted setting where k only serves as 
the upper bound of the true number of extreme points for the purpose of model parameter- 
ization. The special case in which the number of extreme points of Go is known a priori is 
also interesting and will be considered. 
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Let extr G denote the set of extreme points of a given polytope G. Q k is the set of 
population polytopes in A d such that | extr G\ < k. Let Q* = L)2<k<ooG k be the set of 
population polytopes that have finite number of extreme points in A d . A natural metric on 
Q* is the following "minimum-matching" Euclidean distance: 

cIm(G,G)= max min 110 — 11 V max min [10 — 0||. 

0GextrG0'ecxtrG' 0'eextr G' OGextr G 

A more common metric is the Hausdorff metric: 

d H (G,G') = min{e > 0|G C G';G' C GA = maxd(0,G') V max d(0', G). 

0&G ' g'eG' 

Here, G e = G + B d+1 (0,e) := {0 + e|0 G G,e G R d+1 , [|e|| < 1}, and d(0,G') := 
inf{||0 — 0'||, 0' G G"}. Observe that d-^ depends on the boundary structure of sets, while 
djvt depends on only extreme points. In general, dj^t dominates du, but under additional 
mild assumptions the two metrics are equivalent (see Lemma [Q. 

We introduce a notion of regularity for a family probability distributions defined on 
convex polytopes G G Q* . This notion is concerned with the behavior near the boundary 
of the support of distributions -Ptj|g- We say a family of distributions {-P^dG £ G k } is 
a-regular if for any G G Q k and any -q Q G bd G, 

P v \ G (\\v-Vo\\<^>ce a vol p (GnB d+1 (r 1o ,e)). 

where p is the number of dimensions of the affine space aff G that spans G, constant c > 
is independent of G, r/ and e. 

Assumptions. II is a prior distribution on 0i, . . . , 0& such that the following hold for the 
relevant parameters that reside in the support of II: 

(50) Geometric properties (Al) and (A2) listed in Section |3]are satisfied uniformly for all 
G. 

(51) Each of 0i,... ,0k is bounded away from the boundary of A d . That is, if Oj = 

(0j,o, Oj,d) then mim =0) ...,d 9j,i > c for all j = 1, . . . , k. 

(52) For any small e, II(||0j - 0}|| < e V? = 1, . . . , k) > c' e kd , for some c' > 0. 

(53) (3 = (/3\, . . . , /3fc) is distributed (a priori) according to a symmetric probability distri- 
bution P/3 on A fe_1 . That is, the random variables j3i, . . . , fa are exchangeable. 

(54) Pp induces a family of distributions {P v \q\G G G k } that is a-regular. 

Theorem 1. Let Gq G Q k and Gq is in the support of prior II. Let p = (k — 1) A d. 
Under Assumptions (S0—S4) of the admixture model, as m — > oo and n — > oo such as 
log log m < log n = o(m), for some sufficiently large constant C independent ofm and n, 

U(d M (G , G) > C5 m , n \s\™ ] ) — ► (4) 
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in Pg 1 ^^-probability. Here, 



logm logn logn 



i 



2(p+a) 



m n m 

The same statement holds for the Hausdorjf metric du. 

Remarks, (i) Geometric assumption (SO) and its consequences are pre sented in the nex t 



section. (S0)(S l)and (S2) are mild assumptions observed in practice (cf. lBlei et all 1200311 . 
IPritchard et all ll2000lo . (S4) is a standard assumption that holds for a range of a, when Pp^ 
is a Dirichlet distribution (see Lemma H), but there may be other choices. The assumption 
in (S3) that Pp is symmetric is relatively strong, but again it has been widely adopted (e.g., 
symmetric Dirichlet distributions, including the uniform distribution). It may be difficult 
to try to relax this assumption if one insists on using Hausdorff metric, see the remark 
following the statement of Lemma |7J 

(ii) In practice Pp may be further parameterized as Pp\^, where 7 is endowed with 
a prior distribution. Then, it would be of interest to also study the posterior contraction 
behavior for 7. In this paper we have opted to focus only on convergence behavior of the 
population structure to simplify the exposition and the results. 

(iii) The appearance of both m^ 1 and n^ 1 in the contraction rate suggests that if either 
m or n is small, the rate would suffer even if the total amount of data m x n increases. 
What is quite interesting is the appearance of log n/m. This is rooted in a condition that 
the thickness of the prior support in Kullback-Leibler neighborhood for marginal density 
VSuAG i s appropriately bounded from below. See Theorem |6]for such a lower bound. This 
in turn arises from an upper bound on Kullback-Leibler distance of marginal densities, 
which increases with n (see Lemma©. From a hierarchical modeling viewpoint, this result 
highlights an interesting interaction of sample sizes provided to different levels in the model 
hierarchy. This issue has not been widely discussed in the hierarchical modeling literature 
in a theoretical manner, to the best of our knowledge. 

(iv) Note the constraints that n > log m and log n = o(m) are required in order to 
obtain rates of posterior contraction. These constraints are related to the term log n/m 
mentioned above — they stem from the upper bound on Kullback-Leibler in Lemma|7] The 
remark following the statement of this lemma explains why the upper bound almost always 
grows with n. A very special situation is presented in Lemma [5] where an upper bound 
on Kullback-Leibler distance can be obtained that is independent of n. However, such a 
situation cannot be verified in any reasonable estimation setting. This suggests that with 
our proof technique, we almost always require n to grow at a constrained rate relatively to 
m in order to obtain posterior contraction rates. 

(v) Since the quantity log n/m arises partly from a fairly general proof technique in 
Bayesian asymptotics, one may wonder whether it is possible to get rid of it by considering 
a point estimation procedure for G. A line of reasoning goes like this. For each i = 
1, . . . ,m, as n — > 00 one can estimate rji arbitrarily well at a rate 0(n -1 / 2 ). All that 
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remain is to estimate poly tope G as if the (exact) samples from P^g we available, an 
estimation task that incurs a rate depending on m, independent of n. Since one does not 
have exact samples from P V \ G , a more formal reasoning of similar spirit is needed: Observe 
that all information we have access to regarding polytope G is through marginal density 
Ps H \G for rn samples of n-vector S [n] . By Theorem |5J if h(p S[n] \ G ,p S[n] \ Go ) ^ at 
a rate, say e m , as m — > oo and n — > oo suitably, then du(G,Go) vanishes at the rate 
(e m + (log n/n) 1 / 2 ) 1 / 7 for some constant 7 > 0. If we can somehow show that a point 
estimate for p$ a Gq (e.g., the maximum likelihood estimator) can yield a parametric rate in 

Hellinger metric, say e m X (log m/m) 1 / 2 , then the convergence rate for G in d% would be 
(log m/m + logn/n) 1 / 27 , which is happily free of log n/m. It is not so obvious if this is 
possible, as sample size n should nonetheless affect the complexity of random vectors 5[ n ] . 
Formally, the entropy number of the space of marginal densities p s \ G generally scales 

1 1 1 1 l n \ I 



with O(logn). A standard derivation (see, e.g.. Ivan de Geerl Il2000ll ) would still yield the 
rate e m x (log n/m) 1 / 2 . Our conclusion is that without strong assumptions such as the one 
discussed in remark (iv), removing quantities such as log n/m from the convergence rate is 
far from trivial. 

(vi) The exponent g?- suggests a slow, nonparametric-like convergence rate. More- 
over, later in Theorem [3] we show that this is qualitatively quite close to a minimax lower 
bound. On the other hand, the following theorem shows that it is possible to achieve a 
parametric rate if additional constraints are imposed on the true Go and/or the prior II: 

Theorem 2. Let G € Q k and G is in the support of prior IL Assume (S0-S4), and either 
one of the following two conditions hold: 

(a) I extr Gq\ = k, or 

(b) There is a known constant tq > such that the pairwise distances of the extreme 
points of all G in the support of the prior are bounded from below by Tq. 

Then, as m — > 00 and n — > 00 such that log m < n and log n = o(m), Eq. ((U) holds with 



log m log n log n 

m n m 



2(l + a) 



The same statement holds for the Hausdorff metric dy_. 

The next theorem produces minimax lower bounds that are qualitatively quite close to 
the nonparametric-like rates obtained in Theorem [T] In the following theorem, 77 is not 
parameterized by (3 and Ofs as in the admixture model. Instead, we shall simply replace 
assumptions (S3) and (S4) on Pp^ by either one of the following assumptions on P V \ G - 

(S5) For any pair of p-dimensional polytopes G'cG that satisfy property Al, 

V(P v \g, P V \G') < dn (G, GT vol p G \ G'. 
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(S5') For any p-dimensional polytope G, P v \g is tne uniform distribution on G. (This 
actually entails (S5) for a = 0.) 

Since a parameterization for rj is not needed, the overall model can be simplified as 
follows: Given population polytope G G A d , for each i = 1, . . . , m, draw r\ i *~ P^\g- F° r 
each j = l,...,n, draw 5* n] = (Xy )"=! ~ Multfa). 

Theorem 3. Suppose that Go G Q k satisfies assumptions (S0)(S1) and (S2). Point estimates 
G = G(S^) are restricted to those that also satisfy (S0)(S1) and (S2). In the following, 

infimum and supremum are taken over the specified domains for Go and G, while the mul- 
tiplying constants in > depend only on constants specified by these assumptions. 

(a) Let q = \k/2\ A d. Under Assumption (S5), we have 

infsupP? n]GQ d-H(G ,G) >(—) 9+a - 
G G wl \mnj 

(b) Let q = [k/2\ A d. Under Assumption (S5'), we have 

(c) Assume (S5'), and that either condition (a) or (b) of Theorem\2\holds, then 

infsup^ ||Go MGo,G)>(^) TO 
Furthermore, if(S5) is replaced by (S5') t the lower bound becomes 1/m. 



Remarks, (i) Although there remain some gap between the posterior contraction rate in 
Theorem Q] and the minimax lower bound in Theorem [3] (a), they are qualitatively compa- 
rable and both notably dependent on d and k. The gap should be expected partly because 
slightly enlarged models are considered in Theorem [3] due to the relaxed parameterization. 
Nonetheless, if k > 2d, and allowing m X n, the rate exponents differ by only a factor of 
4. That is, m -l/2(d+a) v is-a-vis m - 2 /( d + a ). 

(ii) The nonparametrics-like lower bounds in part (a) and (b) in the overfitted setting 
are somewhat surprising even if Pp is known exactly (e.g., Pp is uniform distribution). 
Since we are more likely to be in the overfitted setting than knowing the extract number of 
extreme points, an implication of this is that it is important to in practice to impose a lower 
bound on the pairwise distances between the extreme points of the population polytope. 

(iii) The results in part (b) and (c) under assumption (S5') present an interesting scenario 
in which the obtained lower bounds do not depend on n, which determines the amount of 
data at the bottom level in the model hierarchy. 
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3 Geometric assumptions and basic lemmas 



In this section we discuss the geometric assumptions postulated in the main theorems, and 
describe their consequences using elementary arguments in convex geometry of Euclidean 
spaces. These results relate Hausdorff metric, the minimum -matching metric, and the vol- 
ume of the set-theoretic difference of polytopes. These relationships prove crucial in obtain- 
ing explicit posterior contraction rates. Here, we state the properties and prove the results 
for p-dimensional polytopes and convex bodies of points in A d , for a given p < d. (Convex 
bodies are bounded convex sets that may have an unbounded number of extreme points. 
Within this section, the detail of the ambient space is irrelevant. For instance, A d may be 
replaced by or a higher dimensional Euclidean space). 

Property Al. (Property of thick body): For some r, R > 0, 6 C E A d , G contains the 
spherical ball B P (6 C , r) and is contained in B P (0 C , R). 



Property A2. (Property of non-obstute corners): For some small S > 0, at each vertex 
of G there is a supporting hyperplane whose angle formed with any edges adjacent to that 
vertex is bounded from below by 5. 

We state key geometric lemmas that will be used throughout the paper. Bounds such as 
those given by Lemma [2] are probably well-known in the folklore of convex geometry (for 
insta nce, part (b ) of th at lemma is similar to (but not precisely the same as) Lemma 2.3.6. 
from lSchneiderl lll993ll ). Due to the absence of direct references we include the proof of 



this and other lemmas in the Appendix. 
Lemma 1. (a) d n (G, G') < d M (G, G'). 

(b) If the two polytopes G, G' satisfy property A2, then djw(G, G') < Codu(G, G'), for 
some positive constant Co > depending only on 5. 

According to part (b) of this lemma, convergence of a sequence of convex polytope 
G G Q k to Go G Q k in Hausdorff metric entails the convergence of the extreme points of G 
to those of Go- Moreover, they share the same rate as the Hausdorff convergence. 

Lemma 2. There are positive constants C\ and c\ depending only on r, R,p such that for 
any two p-dimensional convex bodies G, G' satisfying property Al: 

(a) volp G AG' > C!d n (G, G'f. 

(b) vol p G AG' < Cid H {G, G'). 



Remark. The exponents in both bounds in Lemma[2]are attainable. Indeed, for the lower 
bound in part (a), consider a fixed convex polytope G. For each vertex 6{ G G, consider 
point x that lie on edges incident to 0j such that \\x — 0i\\ = e. Let G' be the convex 
hull of all such x's and the remaining vertices of G. Clearly, d-u(G,G') = 0(e), and 
volp G \ G' < 0(e p ). Thus, for the collection of convex polytopes G' constructed in this 
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way, vol p (G A G') X du(G, G') p . The upper bound in part (b) is also tight for a broad 
class of convex polytopes, as exemplified by the following lemma. 

Lemma 3. Let G be a fixed poly tope and | extr G\ = k < oo. G' an arbitrary poly tope in 
Q*. Moreover, either one of the following conditions holds: 

(a) | extr G'\ = k, or 

(b) The pairwise distances between the extreme points of G' is bounded away from a 
constant ro > 0. 

Then, there is a positive constant e$ = eo(G) depending only on G, a positive constant 
c-2 = C 2(G) in case (a) and C2 = C2(G, ro) in case (b), such that 

vol p GAG' > c 2 d n {G,G') 

as soon as du(G, G') < eo(G). 

Remark. We note that the bound obtained in this lemma is substantially stronger than the 
lemma obtained by Lemma [2] part (a). This is due to the asymmetric roles of G, which is 
held fixed, and G' , which can vary. As a result, constant C2 as stated in the present lemma is 
independent of G' but allowed to be dependent on G. By constrast, constant c\ in Lemma|2] 
part (a) is independent of both G and G'. 

4 An abstract posterior contraction theorem 

In this section we state an abstract posterior contraction theorem for hierarchical models, 
whose proof is given in the Appendix. The setting of this theorem is a general hierarchical 
model defined as follows 

G~n, T7 1 ,...,TjJG~P 7J | G 

SUlVi- Ps [n] \r li fori = l,...,m. 

The detail of conditional distributions in above specifications is actually irrelevant. Thus 
results in this section may be of general interest for hierarchical models with m x n data. 

As before Ps [n] \g i s marginal density of the generic 5j n ] which is obtained by integrating 
out the generic random vector rj (e.g., see Eq. ©). We need several key notions. Define the 
Hausdorff ball as: 

B dn (G u 8) := {GeA d :d n (G 1 ,G)<5}. 

A useful quantity for proving posterior concentration theorems is the Hellinger information 
of Hausdorff metric for a given set: 
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Definition 1. Fix G$ G Q k . Q is a subset ofQ k . For a fixed n, the sample size ofS\ n j, define 
the Hellinger information of metric for set Q as a real-valued function on the positive 
reals ^g n : R + -> R: 

We also define $g >n : R + — ^ R to be an arbitrary non-negative valued function on the 
positive reals such that for any 5 > 0, 

SUp /l 2 (PS [nl |G,PS [nl |G') < *0,»(tf)/4 

G,G'eS; dH(G,G')<$g tn (6) 

In both definitions of <i> and we suppress the dependence on (the fixed) Go and 5 to 
simplify notations. Note that if Go G it follows from the definition that 3>g, n ($) < ^/2- 

Remark. Suppose that conditions of Lemma |7](b) hold, so that 

h 2 (p S[n] \G,PS [n] \Gi) < K(ps ln] \ G ,Ps [n] \G>) < -C d H (G,G'). 

^0 

Then it suffices to choose &g, n (5) = j^^g,n{$)- 

Define the neighborhood of the prior support around Go in terms of Kullback-Leibler 
distance of the marginal densities p s \ G : 



B K (G ,S) = {G G g*\K(p Sln]lGo ,p S[n]lG ) < 5 2 ;K 2 (p S[n]lGlV p S[n]lG ) < 5 2 }. (6) 
Theorem 4. Suppose that 

(a) m — > oo and n — > oo at a certain rate relative to m, 

(b) There is a sequence of subsets Q m C Q*, a large constant C, a sequence of scalars 
e m,n defined in terms of m and n such that me^ n tends to infinity, such that 

sup logD(^g mtn (e),g m nB dn (G l ,e/2),d n ) (7) 
+ logD(e/2,g m riB dn (G ,2e) \ B du (G ,e),d n ) < me 2 m>n 

Tl(g* \ g m ) < exp[-me^ n (G + 4)], (8) 
U(B K {G ,e m , n )) > exp[-me^ n G]. (9) 

(c) There is a sequence of positive scalars M m such that 

*g m ,n(M m e m ,n) > 8e^ n (C + 4) (10) 

exp(2me^ n ) ^ exp[-m*g mjn (je m)n )/8] -)■ 0. (11) 

j>M m 
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Then, IL(G : d H (G ,G) > M m e m,n\<S^ ) — > in Pg 1 I q ^-probability as m and 
n — > oo. 

The proof of this theorem is deferred to the Appendix. As noted above, this result is 
applicable to any hierarchical models for m x n data. The choice of Hausdorff metric du is 
arbitrary here, and can be replaced by any other valid metric (e.g., dj^). The remainder of 
the paper is devoted to verifying the conditions of this theorem so it can be applied. These 
conditions hinge on our having established a lower bound for the Hellinger information 
function \?g m n (") ( y i a Theorem [5]), and a lower bound for the prior probability defined on 
Kullback-Leibler balls Bk(Gq, •) (via Theorem [6]). Both types of results are obtained by 
utilizing the convex geometry lemmas described in the previous section. 



5 Contraction properties 

The following contraction result guarantees that as marginal densities of 5r n i get closer in 
total variation distance metric (or Hellinger metric), so do the corresponding population 
poly topes in Hausdorff metric (or minimum matching metric). This gives a lower bound 
for the Hellinger information defined by Eq. ©, because h is related to V via inequality 

h > V. 

Theorem 5. (a) Let G, G' be two convex bodies in A d . G is a p-dimensional body con- 
taining spherical ball B p (0 c , r), while G' is p' -dimensional body containing B p i(6 c , r)for 
some p, p' < d, r > 0, 6 C 6 A d . In addition, assume that both P v \g an d Pr]\G' are ct-regular 
densities on G and G', respectively. Then, there is c\ > independent ofG,G' such that 



Cl d n (G,G')^ P ' )+a < V(p S[n] \G,Ps M \G>) + S(d + l)exp 



8(d + 1 



-d H (G,G' 



(b) Assume further that G is fixed convex poly tope, G' an arbitrary poly tope, p' = p, 
and that either \ extr G'\ = \ extr G\ or the pairwise distances of extreme points of G' is 
bounded from below by a constant > 0. Then, there are constants C2, C3 > depending 
only on G and tq ( and independent of G') such that 



c 2 d H (G,G') 1+a < V(p s , G , Ps , G ,) + 6(d+ l)exp 



7Ti^-rMG,G'f 
C 3 (d + 1) 



Remark. Part (a) holds for varying pairs of G, G' satisfying certain conditions. It is 
consequence of Lemma|2](a). Part (b) produces a tighter bound, but it holds only for a fixed 
G, while G' is allowed to vary while satisfying certain conditions. This is a consequence of 
Lemma [3] Constants c\,C2 are the same as those from Lemma [2] (a) and [3] respectively. 

Proof, (a) The main idea of the proof is the construction of a suitable test set in order to 
distinguish Ps [n] \G' from Ps [n] \G- The proof is organized as a sequence of steps. 



14 



Step 1 Given a data vector <S[ n ] = (X\, . . . , X n ), define r)(S) G A d such that the i- 
element of r)(5) is ^ Sj=i ^-(Xj = i) for each i = 0, . . . , d. In the following we simply 
use r) to ease the notations. By the definition of the variational distance, 

V( PS[n] \G,PS [n] \G>) = sup |Ps (n] | G (*7 G A) - P 5ln] |G'0 6^)|, d2) 
where the supremum is taken over all measurable subsets of A d . 

Step 2 Fix a constant e > 0. By Hoeffding's inequality and the union bound, under the 
conditional distribution P$ \ v , 

P <S [n] |7,(.max d |^-r?i| > e) < 2(d + 1) exp(-2rae 2 ) 

with probability one (as 77 is random). It follows that 

P v xS ln] \G(\\v ~ V\\ > e) < P ^xS w |g(. max Jr); -r?j| > e(o!+ 1)~ 1/2 ) 

< 2(d + l)exp[-2ne 2 /(d + 1)]. 
The same bound holds under -P^xS^IG'- 

Step 3 Define event B = {||r) — 77 1| < e}. Take any (measurable) set A C A d , 

\Ps [n] \G(fl£A)-P S[n]lG ,(f,eA)\ 
= \P v xS [n] \G(v € A;B)+P vxS[n]lG (fi G A; B C ) 

-P v xS [n] \G'(V G A;B) - P^x5 [n] |G'(^ G A;B C )| 
> l^,x5 w |G(^ G -Pr,x5 [n] |G'(») G A;£)l 

-4(d+ l)exp[-2ne 2 /(d + 1)]. (13) 

Step 4 Let ei = du(G,G')/4. For any e < ei, recall the outer e-parallel set G e = 
(G+Bd+i(0, e)), which is full-dimensional (d+ 1) eventhough G may not be. By triangular 
inequality, dy^(G e ,G' e ) > du(G,G')/2. We shall argue that for any e < e\, there is a 
constant c\ > independent of G, G', e and ei such that either one of the two scenarios 
holds: 

(i) There is a set A* C G \ G' such that A* n G' e = and vol p (A*) > c x e\, or 

(ii) There is a set A* C G' \ G such that A* n G e = and voly (A*) > cief . 

Indeed, since e < du(G,G')/4, either one of the following two inequalities holds: 
d H {G \ G' 3e ,G') > d n (G,G')/4 or d n (G' \ G 3e ,G) > d n (G,G')/4. If the former 
inequality holds, let A* = G \ G' 3e . Then, A* C G \ G' and A* n G' £ = 0. Moreover, by 
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Lemma [2] (a), vol p (^4*) > cie\, for some constant c\ > independent of e, e±, G, G' , so 
A* satisfies (i). In fact, using the same argument as in the proof of Lemma [2] (a) there is a 
point s S bdG such that G' n B p (x, ei) = 0. Combined with the a-regularity of P v \g> we 
have P V \ G (A*) > e a vol p (G n B p (x, ei)) > cie p+a for some constant ci > 0. If the latter 
inequality holds, the same argument applies by defining A* = G' \ G^ e so that (ii) holds. 

Step 5 Suppose that (i) holds for the chosen A*. This means that P„ x s, Aciv G ^4* ; B) < 

I [n\ I 

P v \G'{v G ^2e) = 0, since i4| e D G" = 0, which is a consequence of A* ( \ G' e = 0. In 
addition, 

*W [b] |g(»>€ > P r,x5 [n] |G(»7 G A*;B) 

> P vlG (A*) - P vxS[n]lG (B c ) 

> P V \ G (A*) - 2(d + 1) exp(-2ne 2 /(d + 1)) 

> Cl el +a -2{d+l)exp(-2ne 2 /(d+l)). 

Hence, by Eq. O I^IgW e ^)-^s H |G'(^ G > Cl e? +a -6((i+l) exp(-2ne 2 ). 
Set e = ei, the conclusion then follows by invoking Eq. (fT2l) . The scenario of (ii) proceeds 
in the same way. 

(b) Under the condition that the pairwise distances of extreme points of G' are bounded 
from below by r$ > 0, the proof is very similar to part (a), by involking Lemma|3] Under the 
condition that | extr G'\ = k, the proof is also similar, but it requires a suitable modification 
for the existence of set A*. For any small e, let G e be the minimum- volume homethetic 
transformation of G, with respect to center 6 C , such that G e contains G e . Since B P (6 C , r) C 
G C B P (0 C , R) for R = 1, it is simple to see that d H {G, G e ) < eR/r = e/r. 

Set e\ = du{G, G')r/4. We shall argue that for any e < e±, there is a constant cq > 
independent of G' , e and e\ such that either one of the following two scenarios hold: 

(iii) There is a set A* C G \ G' such that A* n G' e = and vol p (A*) > c 2 ei, or 

(iv) There is a set A* cG'\G such that A*nG E = and vol p (A*) > c 2 ei. 

Indeed, note that either one of the following two inequalities holds: d^(G \ G'^ e ,G r ) > 
d n (G, G')/4 or d n {G' \ G 3e ,G) > d H (G, G')/4. If the former inequality holds, let A* = 
G\G' 3e . Then, A* c G\G' and A*nG' e = 0. Observe that both G and G' 3e have the same 
number of extreme points by the construction. Moreover, G is fixed so that all geometric 
properties A2, Alare satisfied for both G and G' 3fL for sufficiently small d%(G,G'). By 
Lemma |3l \o\ p {A*) > C2Ci. Hence, (iii) holds. If the latter inequality holds, the same 
argument applies by defining A* = G' \ Gs e so that (iv) holds. 

Now the proof of the theorem proceeds in the same manner as in part (a). 

□ 
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6 Concentration properties of the prior support 



In this section we study properties of the support of the prior probabilities as specified by 
the admixture model, including bounds for the support of the prior as defined by Kullback- 
Leibler neighborhoods. 



a-regularity. Let (3 be a random variable taking values in A k ~ 1 that has a density p@ 

(with respect to the k — 1-dimensional Hausdorff measure on M fc ). Define random variable 

r] = ftiOi + . . . + /3k0k, which takes values in G = conv(#i, . . . , Ok)- Write rj = Lf3, 

where L = [0\ . . . 6^] is a (e?+ 1) x k matrix. If A: < d+ 1, 6±, . . . , 0% are generally linearly 

independent, in which ca se matrix L has rank k — 1. By the change of variable formula 

lEvans and Gariepyl Jl992ll (Chapter 3), Pp induces a distribution P v \g on G C A d , which 

admits the following density with respect to the k — 1 dimensional Hausdorff measure on 
A d. 

Pr) ( V \G)= Pf3 (L- 1 ( V ))J(L)- 1 . (14) 

Here J(L) denotes the Jacobian of the linear map. On the other hand, if k > d + 1, then 
L is generally (f-ranked. The induced distribution for r) admits the following density with 
respect to the d-dimensional Hausdorff measure on 



Pv(v\G) 



l-Hv} 



(15) 



A c ommon choice f or Pp is the Dirichlet distribution, as adopted by iPritchard et al. 

iboodl . lBleietal] J2003h : given parameter 7 G Ri, for any A C A* -1 , 



r / E7i) TT/ 3 ?' 1 ^"' 1 ^)- 



Lemma 4. Let r\ = Ylj=i fij@j> where (5 is distributed according to a k — 1-dimensional 
Dirichlet distribution with parameters jj £ (0, 1] for j = 1, . . . , k. 

(a) Ifk < d + 1, there is constant €q = eo(k) > 0, and constant c§ = cq^j, k,d) > 
dependent on 7, k and d such that for any e < eo, 



inf inf ; P v \g(\\v-V*\\ <e) >c 6 e 



k-1 



(b) Ifk>d+1, the statement holds with a lower bound CQe d+ ^-' i ^ 1 7i . 

A consequence of this lemma is that if jj < 1 for all j = 1, . . . , k, k < d + 1 and 
G is k — 1-dimensional, then the induced P v \g nas a Hausdorff density that is bounded 
away from on the entire its support A fc_1 , which implies 0-regularity. On the other hand, 
if Jj < 1 for all j, k > d + 1, and G is d-dimensional, the P v \g is at least Y2j=ilj~ 
regularity. Note that the a-regularity condition is concerned with the density behavior near 
the boundary of its support, and thus is weaker than what is guaranteed here. 
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Bounds on KL divergences. Suppose that the population polytope G is endowed with a 
prior distribution on Q k (via prior on the population structures 0\, . . . , 6^. Given G, the 
marginal density Ps [n] \G °f ^-vector <S[ n ] is obtained via Eq. Q. To establish the concen- 
tration properties of Kullback-Leibler neighborhood B>k as induced by the prior, we need 
to obtain an upper bound on the KL divergences for the marginal densities in terms of 
Hausdorff metric on population polytopes. First, consider a very special case: 

Lemma 5. Let G, G' G A d be closed convex sets satisfying property Al. Moreover, assume 
that 

(a) G C G', aff G = aff G' is p-dimensional, for p < d. 

(b) P v \g (resp. P^c) are uniform distributions on G, (resp. G'). 

Then, there is a constant C\ = Ci(r,p) > such that K(j>g \GiPS\ n AG') — C\du(G,G'). 

Proof. First, we note a well-known fact of KL divergences: the divergence between marginal 
distributions (e.g., on <Sr n i) is bounded from above by the divergence between joint distri- 
butions (e.g., on rj and S\ n ] via Eq. dTJ): 

K(p S[n] \G,PS [n] \G>) < K{P vxS[n]lG ,P vxS[n]lG ,). 

Due to the hierarchical specification, p vxS \ G = P v \g x Ps [n] \ v and P v xS\G' = P V \G> x 
Ps [n] \ v > so ^'(■PfixS [B] |G>-PijxS [n] |G') = K (Pv\GiPr,\G')- The assumption aff G = aff G' 
and moreover G C G' implies that K (Pn\GiPri\G') < °°- I n addition, P v \g and P v \g' are 
assumed to be uniform distributions on G and G' , respectively, so 

K (Pv\G,Pr, ]G >)= J log ^-^dP^o. 

By Lemma[2](b), log[vol p G' / vol p G] < log(l + Cid n (G, G')) < Cid H {G, G') for some 
constant C\ = C\ (r, p) > 0. This completes the proof. □ 



Remark. The previous lemma requires a particular stringent condition, aff G = aff G' , 
and moreover G C G', which is usually violated when k < d+ 1. However, the conclusion 
is worth noting in that the upper bound does not depend on the sample size n (for 5r n ]). The 
next lemma removes this condition and the condition that both p v \g and p v \G' De uniform. 
As a result the upper bound obtained is weaker, in the sense that the bound is not in terms 
of a Hausdorff distance, but in terms of a Wasserstein distance. 

Let Q(rji,rj 2 ) denote a coupling of P(r)\G) and P(rj\G'), i.e., a joint distribution 
on G x G' whose induced marginal distributions of r\ x and r\ 2 we equal to P(rj\G) and 
P(rj\G'), respectively. Let Q be the set of all such couplings. The Wasserstein distance 
between p v \ G and p n \c> is defined as 

Wx{pj,\ G ,p v \G>) = inf / H77! - TfcH dQ{r} x ,r] 2 ). 
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Lemma 6. Let G, G' C A d be closed convex subsets such that any t] = (r/o, . • • , %) G 
GU G' satisfies min; = o,...,<2 ry; > co/or some constant c$ > 0. T/iera 

^(P5 [n] |G,P5 [n] |G') < -Wi(p„| G ,p„| G 0. 
°0 

Remark. As n — > oo, the upper bound tends to infinity. This is expected, because the 
marginal distribution Ps,ag should degenerate. Since typically aff G ^ aff G' , Kullback- 
Leibler distances between Ps [n] \G an d Ps [n] \G' should typically tend to infinity. 

Proof. Associating each sample S\ n i = (X±, . . . ,X n ) with a d + 1-dimensional vector 
rf(S) G A d , where rj(S)i = ^ Y^=i K-^j = i) for each i = 0, . . . , d. The density of <S[ n ] 
given G (with respect to the counting measure) takes the form: 

£ log ri, )dP(rj\G). 

JG JG V i=Q / 

Due to the convexity of Kullback-Leibler divergence, by Jensen inequality, for any 
coupling Q G Q: 



K(p S[n] \G,PS [n] \G>) =K { K f P( S [n]\Vl) dQ(Vl,V2),J P(S[n]\V2) dQ(Vl,V2) 

< / K{p{S [n] \r} l ) J p{S [n] \'n 2 )) dQirji,^). 



It follows that K(ps [n] \G,Ps ln] \G>) < inf Q J ' K(p S[n] \r, v PS {n] \r, 2 ) dQ(vi,V2)- 

NotethatK(P S[n] | TJi ,P S[n] | 7j2 ) = Y, S[n] n(K{r}(S),r} 2 )-K(r](S), Vi))Ps [n] \r,^ where 
the summation is taken over all realizations of 5[ n ] G {0, . . . , d}™. For any rj(S) G A ', 
77! G G and % G G", 



1^(77(5),^) -^(r7(<S),T7 2 )| = l^r/^log^M/^)! 

i=0 

< 2»/(5)i|m,< -»/2,»l/c0 

i 

< (E^( 5 )') 1/2 H^i-^ll/co 

i 

< — r? 2 ll/ c o- 

Here, the first inequality is due the assumption, the second due to Cauchy-Schwarz. It fol- 
lows that K (Ps [n] \ m ,Ps [n] \r, 2 ) < nWrii-^W/^soKlps \ G ,ps \ G ,) < ^W^p^p^). 

□ 
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Lemma 7. Let G = conv((?i, . . . , Ok) and G' = conv(0^, . . . , 6' k ) (same k). A random 
variable r\ ~ P-q\G ™ parameterized by rj = ]T\ (3jT)j, while a random variable rj ~ P^\G' 
is parameterized by r\ = Y2j PjWj* where (3 and (3 are both distributed according to a 
symmetric probability density pp. 

(a) Assume that both G, G' satisfy property A2. Then, for small du(G, G'), 
Wi(p v \GiPri\G') ^ Cod-}{(G, G')for some constant Co specified by Lemma\l\ 

(b) Assume further that assumptions in Lemma\6\hold, then K(p^^Q,p^^Qi) < 
%C d n (G,G'). 

Remark. In order to obtain an upper bound for K(p s ^\Q,p s ^\ G i) in terms of du(G, G'), 
the assumption that pp is symmetric appears essential. That is, random variables Pi , . . . , flf. 
are exchangeable under pp. Without this assumption, it is possible to have d%(G, G') = 0, 
but K(p S[n] \G,Ps ln] \G>) >0. 

Proof. By Lemma Q] under property A2, (^(G, G) < Cody,(G,G') for some constant 
Go- Let d%(G,G') < e for some small e > 0. Assume without loss of generality that 
\6j — 8j\ < Cq€ for all j = 1, . . . , k (otherwise, simply relabel the subscripts for O'j's). 

Let Q(r], r)') be a coupling of P V \ G and P v \g' sucn that under Q, rj = 2~2j=i and 
rj' = Ylj=i i- e -» V and vl share the same (3, where (3 is arandom variable with density 
pp. This is a valid coupling, since pp is assumed to be symmetric. 

Under distribution Q, E\\rj - rf\\ < EE*=i^ill^ ~ e 'jW ^ C o eE E*=i0j = 
Hence Wi(P v \ G , P v \ g >) < C$e. Part (b) is an immediate consequence. □ 

Recall the definition of Kullback-Leibler neighborhood given by Eq. ©. We are now 
ready to prove the main result of this section: 

Theorem 6. Under Assumptions (SI) and (S2), for any Go in the support of prior II, for 
any 5 > and n > log(l/ S) 

U(GGB K (G ,5))>c(5 2 /n 3 ) kd , 

where constant c = c(cq, c' ) depends only on cq, c' . 



Proof. We shall invoke a bound of IWong and Shenl ll 199511 (Theorem 5) on the KL diver- 



gence. This bound says that if p and q are two densities on a common space such that 
J P 2 /l < M, then for some universal constant eo > 0, as long as h(p, q) < e < eo, there 
holds: K(jp,q) = 0(e 2 log(Af/e)), and K 2 (p,q) := / p(log(p/g)) 2 = 0(e 2 [log(M/e)] 2 ), 
where the big O constants are universal. 

Let Go = conv(#^, ...,&%). Consider a random set G G Q k represented by G = 
conv(0i, . . . , Ok), and the event £ that \\0j — 0*\\ < e for all j = 1, . . . , k. For the 
pair of Go and G, consider a coupling Q for P^g and Ptj\g sucn that any (r)i,r} 2 ) dis- 
tributed by Q is parameterized by r) 1 = j3\0\ + . . . fikOk and r\ 2 = PiQ* + • • • fik^k 
(that is, under the coupling r) 1 and r\ 2 share the same vector f3). Then, under Q, E||?7 1 — 
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V2W — e - This entails that W±(P v \q, Pfj\G ) — e - (We note here that the argument 
appears similar to the one from Lemma [7] but we do not need to assume that pp be 
symmetric in this theorem). If G is randomly distributed according to prior II, under 
assumption (S2), the probability of event £ is lower bounded by c' e kd . By Lemma [6l 
h 2 (p Go ,PG) < K( PGo , PG )/2 < (n/c )Wi(P ?J | G ,P T7 | f?0 ) < ne/(2c ). Note that the den- 
sity ratio p S[n] \ G /ps [n] \G < (Vco) n , which implies that ^s [n] Ps [n] \G /Ps [n] \G < (l/c ) n - 
We can apply the upper bound described in the previous paragraph to obtain: 

/ ne Tl Oct, 1 1 2n 

K2(ps, n] \G ,PS, n] \G) =0{ 



\2c 



I, 2c 1 
-log Vn log — 

2 ne cq 



Here, the big O constant is universal. If we set e = 5 2 /n 3 , then the quantity in the right 
hand side of the previous display is bounded by 0(5 2 ) as long as n > log(l/<5). Combining 
with the probability bound c' Q e kd derived above, we obtain the desired result. □ 



7 Proofs of main theorems and auxiliary lemmas 

Proof of Theorem [I] (Overfitted setting). 

Proof. The proof proceeds by verifying conditions of Theorem @] Let e m ,n = (log m/m) l / 2 + 
(log n/m) 1 / 2 + (log re/re) 1 / 2 . Choose the sequence of subsets Q m simply to be the support 
of prior II, so that Yl(Q* \ Q m ) = 0. Note that Q m C Q k . Condition © trivially holds. 
Turning to the entropy conditions, we note that 

logD(e/2,g m nB n (G ,2e),d n ) < log iV(e/4, Q m n B n (G , 2e), d H ) = 0(1). 

By Theorem |5](a), assumption (S4) and the general inequality that h > V, we have: 

*e m ,n(e) > [ci(e/2)P +Q -6(d+l)e" ne2 / 32 ( d+1 )] 2 , wherepis defined asp = min(&-l, d). 
So *g m , n (e) > ce 2( -P+^ as long as Cl (e/2)P +a > 12(d + 1) exp[-ne 2 /32(d + 1)]. Here, 
c is a constant depending on c\ , p, d. This is satisfied if e is bounded from below by a large 
multiple of e m , n > (log n/n) 1 / 2 . Using $g, n (6) := j^^g,n{ 6 )' il follows that 

Iog£)(cd*o m) „(e)/(4nCb),g m nB w (Gi,e/2),d M ) 

< log N(c ce 2 ^/(AnC ),g m n B H (G x ,e/2), d H ) 

< \ og ( n kd e -(2p+2a-l)kd^ < ^ 

where the last inequality holds since e is bounded from below by a large multiple of e m ,n > 
(logra/m) 1 / 2 + (log m/m) 1 / 2 . Thus, the entropy condition ©) is established. 
To verify condition Eq. (TTTb . we note that for some constant c > 0, 

exp(2me 2 nn ) ^ exp[-m*0 m , n (je mjn )/8] 

j>M m 

< exp(2me 2 „ in ) ^ exp[-cm(je mi „) 2 ( p+Q V8] 

J>Mm 

< exp^me^J ex V [-cm(M m e m , n ) 2 ^ /8], 
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where the right side of the above display vanishes if {M m e m ^ n ) p+a is a sufficiently large 

p+a — 1 

multiple of e m>n . This holds if we choose M m = Me m ,n +a for a large constant M. 
Eq. dTOb also holds. 

It remains to verify Eq. d9j. By Theorem[6] as long as n > log(l/e mjra ), 

lo g n(G € B K {G ,e m , n )) > c(c )log(e 2 min /n 3 ) kd = c(c )kd(2 log e m , n - 31ogn). 

Eq. (© holds for a sufficiently large constant G because e mi „ > (log to/to) x / 2 + (log to/to) 1 / 2 , 
and the constraint that n > log to. 

Now, we can apply Theorem[4]to obtain a posterior contraction rate M m e m ^ n x em^ +a ' . 

□ 

Proof of Theorem |2j The proof proceeds in exactly the same way as Theorem [TJ except 
that part (b) of Theorem |5]is applied instead of part (a). Accordingly p is replaced by 1 in 
the rate exponent. 



Proof of Theorem [3] (Minimax lower bounds), (a) The proof involves the construction 
of a pair of polytopes in Q k whose set difference has small volume for a given Hausdorff 
distance. We consider two separate cases: (i) k/2 < d and (ii) k > 2d. 

If k/2 < d, consider a. q = [k/2\ -simplex Go that is spanned by q + 1 vertices in 
general positions. Take a vertex of Go, say 6q. Construct G' by chopping Go off by an 
e-cap that is obtained by the convex hull of 6q and q other points which lie on the edges 
adjacent to 6q, and of distance e from Oq. Clearly, G has 2q < k vertices, so both Go and 
G' are in Q k . We have d^(Go,G' ) x e, and vol g (Go \ G' ) x e q . Due to Assumption 
(S5), V(jp n \Q ,p n \Gi Q ) ;$ e q+a . We note here and for the rest of the proof, the multiplying 
constants in asymptotic inequalities depend only on r, R, 5 of properties Al and A2. 

If k > 2d, consider a ci-dimensional polytope Go which has k — d + 1 vertices in 
general positions. Construct G' in the same way as above (by chopping Go off by an e-cap 
that contains a vertex 6q which has d adjacent vertices). Then, G has (k— d+l) — 1+d = k 
vertices. Thus, both G' and Go are in Q k . We have d^(Go, G ) x e, and voLi(Go \ G ) X 
e d . Due to Assumption (S5), V{p vlGo , Pri]G > Q ) < e d+a . 

To combine the two cases, let q = min( [k/2\ , d). We have constructed a pair of 
G ,G G G k such that dn(G ,G f ) x e, and V(p v]Go ,p v \ G ' ) < e«+ a . By LemmaEl 
K{Ps [n] \G ,Ps [n] \G' ) ^ nWiip^GoiP^G'J ^ nV(p n \ Go ,p v \ G > Q ) < Cne q+a for some con- 
stant G > independe nt of e and n. Note that the second inequality in the above display is 
due to Theorem 6.15 of lvillanil ibOQSll . 



Applying the method due to Le Cam (cf. Yu [ 1997], Lemma 1), for any estimator G, 
I'x ir:..'/w(G.G) S <( 1 - ' 

Ge{Go,G' } 



^„ p s M \G d H (G,G) > e(l - -V{P^ Go ,P^ G ,)). 
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Here, P^^ Gq denotes the (product) distribution of the m-sample 5^, . . . ,SJ%y Thus, 
V 2 (P^ lr ,P™ lr ,) < h 2 (P™ , r ,P™ lrl ) 

_ i / [pm pn 1 1/2 
- 1 J PS w \Ch^S w \G' l 

= l-[l-h 2 ( P s ln] \ Go ,p S[n] \G' )] m 
< 1 - (1 - Cne q+a ) m . 

The last inequality is due to h 2 ( PS[n]lGolPS[n]]G , o ) < K( PS[n]lGo , PS[n]lG , Q ) < Cne<* +a . 
Thus, 

Gfc|Go,G > ^ 

Letting e g+Q = ^j^r, the right side of the previous display is bounded from below by 

6(l-I(l-l/2)V2) mn 

(b) We employ the same construction of Go and G' as in part (a). Using the argument 
used in the proof of Lemma [5] K(p S[n] \ G ' ,Ps [n] | G ) = / log[vol ? G / vol q G' )dP vlGo < 
/log(l + Ce*)P„| Go < eL So, h 2 (p S[n]lGo ,p S[ri]l&o ) < K{ PS[n] ^p S[n] \ Go ) < A Then, 
the proof proceeds as in part (a). 

(c) Let G' be a polytope such that | extr G" | = | extr Gq\ = k and du(G' , Gq) = e. 
By Lemma[2l vol p (Go A G' ) = 0(e), where p = (k — 1) A d. The proof proceeds as in 
part (a) to obtain (l/nrn) 1 ^ 1 ^ rate for the lower bound under assumption (S5). Under 
assumption (S5'), as in part (b), the dependence on n can be removed to obtain 1/m rate. 

Proof of a-regularity of the Dirichlet-induced densities in Lemma IH 

Proof. First, consider the case k < d + 1. For ij* G G, write rj* = fi\9\ + . . . + filO k . 
For (3 € A fc_1 such that |/% - (3*\ < e/k for all i = 1, . . . , k — 1, we have ||t/ - = 

II E*=i(A - ft? < Eti I A " /3*| < 2 Eti I A - #1 < 2e. Here, we used the fact 
that < 1 for any 0j E A d . Without loss of generality, assume that /3| > Then, 
for any e < 1/k 

PrjlciWv -V*\\< 2e) > P^IA - /3*| < e/fc; i = 1, . . . , k - 1) 

-p/V^v \ /> fc — 1 fc — 1 

Hi 1 (7iJ Jfte[0,i];|ft-/3*|<e/fc;i=i,...,fc-i f = i 



* TTT^T II / ^ d & ^ TTri^T^^ 

Hi r (7i) Jmax(7*-e/fe,0) 1 1 4 r (7i) 



fc-1 



Both the second and the third inequality in the previous display exploits the fact that since 
7i < 1, x 7l_1 > 1 for any x < 1. 

Now, consider the case A; > d + 1. The proof in the previous case applies, but we can 
achieve a better lower bound because the intrinsic dimensionality of G is d, not k— 1. Since 
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r]* G conv(0i, . . . , Ok) C A , by Caratheodory's theorem, r/* is the convex combination 
of d + 1 or fewer extreme points among 0j's. Without loss of generality, let 0\, . . . , 6d+i 
be such points, and write rj* = /3^0± + . . . /3^ +1 #d+i- Consider rj = fi\0\ + . . . + (3^0^, 
where ||ft -ft\< e/k, for i = 1, . . . , d, while < ft < e/k for i = d + 2, . . . , k. Then, 
||*7 — V* II < 2e. This implies that 



P v \Gi\\ri-V*\\ <2e)>P(3(\Pi-l3i\<e/k,i = l,...,d+l; |&| < e/k,j > d + 1) 

1 li 1 17* j imax( 7 * - e /A,0) ^ d+2 ./O 

This concludes the proof. □ 



8 Appendix A: Proofs of geometric lemmas 

Proof of Lemma [J 

Proof, (a) Let G = conv(0i, . . . , Ok) and G' = conv(<?i, . . . , Q'y). This part of the lemma 
is immediate from the definition by noting that for any x G G, d(x, G') < minj \\x — 0j\\, 
while the maximum of d(x, G') is attained at some extreme point of G. 

(b) Let dfi(G, G') = e for some small e > 0. Take an extreme point of G, say 0\. 
Due to A2, there is a ray emanating from 0\ that intersects with the interior of G and the 
angles formed by the ray and all (exposed) edges incident to 0\ are bounded from above by 
7r/2 — 5. Let x be the intersection between the ray and the boundary of B p (0\, e). 

Let H be ap — 1-dimensional hyperplane in W that touches (intersects with) B p {Q\,e) 
at only x. Define C(x), resp. C e (x), to be the p-dimensional caps obtained by the inter- 
section between G, resp. G with the half-space which contains 0\ and which is supported 
by H. For any x' that lies in the intersection of H and a line segment [0i, Oi], where Oi is 
another vertex of G, the line segment [x, x'] G H and \\x — x'\\ < ecot 5. Suppose that the 
ray emanating from x through x' intersects with bdG e at x". Then, \\x' — x"\\ < e/ sin 5, 
which implies that \\x — x"\\ < e(cot (5 + 1/ sin 5) by triangle inequality. This entails that 
DiamC e (x) < Ce, where C = (1 + (cot <5 + l/sinJ) 2 ) 1 / 2 . 

Now, du(G, G') = e implies that G' n B p {0\, e) / 0. There is an extreme point of 
G' in the half-space which contains B{0\,e) and is supported by H. But G' C G e , so 
there is an extreme point of G' in C e (x). Hence, there is 0'j G G' such that \\0j — 0i|| < 
Diam(C e (x)) < Ce. Repeat this argument for all other extreme points of G to conclude 
that d M {G,G r ) < Ce. □ 
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Proof of Lemma |2] 



Proof, (a) Let d%{G,G') = e. There exists either a point i £ bdG such that G' n 
B p (x,e/2) = 0, or a point x' G bdG' such that G n B p (x,e/2) = 0. Without loss of 
generality, assume the former. Thus, vol p G AG' > vol p B p (x, e/2)nG, Consider the con- 
vex cone emanating from x that circumscribes the p-dimensional spherical ball B p (0 c ,r) 
(whose existence is given by Condition Al). Since \\x — G c \\ < R, the angle between 
the line segment [x, 6 C ] and the cone's rays is bounded from below by sin ip > r/R. So, 
volp Bd(x,e/2) flG> c\t v , where c\ depends only on r, R,p. 

(b) Let d n (G, G') = e. Then G' C G £ and G C G' € . Take any point x G bdG, let x' 
be the intersection between bd G e and the ray emanating from C and passing through x. 
Let H\ be a p — 1 dimensional supporting hyperplane for G at x. There is also a supporting 
hyperplane H2 of G' that is parallel to Hi and of at most e distance away from H\. Since 
|| 6c — x\\ < R, while the distance from C to Hi is lower bounded by r, the angle 99 
between vector C — x and the vector normal to H 1 satisfies cos > r /R. This implies 
that ||x' — x\\ < e/ cos(/? < eR/r, so ||x' — c ||/||x — 9 C \\ < 1 + eR/r 2 . In other words, 
G e -0 C C (l+e J R/r 2 )(G-6> c ). So,vol p G'\G < vol p G e \G < |(l+ei?/r 2 ) p -l] vol p G < 
Gie, where Gi depends only on r, We obtain a similar bound for vol p G \ G', which 
concludes the proof. □ 



Proof of Lemma |3] 

Proof. We provide a proof for case (a). Let G = conv(#i , . . . , 6^) and G' = conv^ , . . . , 6' k ), 
where G is fixed but G' is allowed to vary. Since G is fixed, it satisfies Al and A2 for some 
constants r, R and 5 (depending on G). Moreover, there is some eo = eo(G) depending 
only G such that as soon as dy_{G, G') < eo, G' also satisfies Al and A2 for constants 
5' = 6/2,r' =r/2,R' = 2R. 

Suppose that d% (G, G') = e such that e < eo- By Lemma[T](b) for each vertex of G, say 
6{, there is a vertice of G', say such that 6^ G B p (0i, Goe) with Go = Go(G) depending 
only on 5. Moreover, there is at least one vertice of G, say 0\, for which ||0i — 0i|| > e. 

There are only three possible general positions for 6'i relatively to G. Either 

(i) Q'i € G, or 

(ii) e[ G 20i - G, or 

(iii) 0[ lies in a cone formed by all half-spaces supported by the p — 1 dimensional faces 
adjacent to 9%. Among these there is one half-space that contains G, and one that 
does not contain G. 

If (i) is true, by property Al, G has at least one face S D 61 such that the distance from 6[ to 
the hyperplane that provides support for S is bounded from below by er / R. Let B C S be 
a homothetic transformation of S with respect to center 61 that maps x G 5 to x G B such 
that the ratio 77 := ||#i — a?||/||0i — x|| satisfies 1 — 77 = 2Goe/ min^- ||0» — 0j || G (0, 1/2). 
This is possible as soon as e < min^ ||0j — 0j||/4Gq. Then, for any #j G 5, j 7^ 1, under 
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this transformation Oj h-> 6j G 5 1 for which — = (1 — — 0j\\ > 2C$€. 

Since ||#^ — 0j\\ < Coe, the construction of B implies that O'j ^ B. As a result, B n 
G' = 0. Moreover, vol p _i 5 = rf" 1 vol p _i S > (1/2)p~ 1 vol^ S > c (G), a constant 
depending only on G. Let Q be a p-pyramid which has apex 0[ and base B. It follows 
that relint Q n relint G" = 0, which implies that relint Q C G\G'. (relint stands for the 
relative interior of a set). Hence, vol p G\ G' > vol p Q > p-er/i? vol p _i 1? > ^eco(G)r / R. 

If (ii) is true, the same argument can be applied to show that vol p (G' \ G) = Q(e). 
If (iii) is true, a similar argument continues to apply: we obtain a lower bound for either 
volp G' \ G or volp G \ G' . G has a face (supported by a hyperplane, say, H) such that the 
distance from 6[ to H is Q(e). If the half-space supported by H that contains 6[ but does 
not contain G, then vol p G'\G = Q(e). If, on the other hand, the associated half-space does 
contain G, then vol p G\G' = Q(e). The proof for case (b) is similar and is omitted. □ 



9 Appendix B: Proof of abstract posterior contraction theorem 

A key ingredient in the general analysis of convergence of posterior distributions is through 
establishing the existence of tests for subsets of parameters of interest. A test </? TO) „ is 
a measurable indicator function of the m x n-sample Sr^ = (5^ , . . . , SS ) from an 
admixture model. For a fixed pair of convex polytopes Go, Gi G y, where Q is a given 
subset of A d , consider tests for discriminating Go against a closed Hausdorff ball centered 
at G\. The following two lemmas on the existence of tests highlight the fundamental role 
of the Hellinger information: 

Lemma 8. Fix a pair of (Go, Gi) G (G* X Q) and let 5 = (i-^(Go, Gi). Then, there exist 
tests {v? m ,n} that have the following properties: 

P S [n] \Go ^rn,n < D exp[-m^g in (5)/8] (16) 

sup , G (1 - <p mtn ) < exp[-m* g>n (6)/8]. (17) 

GegnB dH (Gi,8/2) l " JI 



Here, D := D y<&g^ n (5), Q fl Bd n (Gi,5 /2),d-^j, i.e., the maximal number of elements in 
Q H Bd n (Gi, 6/2) that are mutually separated by at least <3?g )ri (<5) in Hausdorff metric d-^. 
Proof. We begin t he proof by noting that a direct application of standard results on exis- 



tence of tests (cf. ICaml [1986], Chapter 4) is not possible, due to the lack of convexity of 



the space of densities of 5r„i as G varies in some subset Q C G*, even if Q is convex. This 
difficulty is overcome by appealing to a packing argument. 

Consider a maximal <I>g.„(<5)-packing in d-u metric for the set Q n Bd n (Gi, 5/2). This 
yields a set of D = D($g )n (5),G fl B dn (Gi,5/2),d n ) elements Gi,...,G D G Q n 
B du (G u b/2). 

Next, we note the following fact: for any t = 1, . . . , D, ifGG^n B dn (Gi, 5/2) and 
dn(G,G t ) < ®g jn (5), then by the definition of h 2 (p s \ G ,p s ,q 
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the definition of Hellinger information, h 2 {ps [n] \G > Ps^ ]\gJ — ^G,n(^)- Thus, by triangle 

5[n]|Go'P'S[ n ] 



inequality, h(p S[n] \G ,PS [n] \G) > ^G,n( 6 ) 1/2 - 



For each pair of Go , Gt there exist tests oj$ of p s | Go versus the Hellinger ball Vi if) ■= 

{Ps [n] \G\G E G*;h(p S[n]lG ,p S[n]]Gt ) < \h(jp S[n] \ Go ,p S{n] p t )} such that, 

P S [n] \G Q W S« < ew[-mh 2 {p S[n]lGo ,p S[n]lGt )/8}, 
sup P 2 m (1 - u$ n ) < exp[-m/i 2 (p 5 | Go ,p 5 ,i Gt )/8]. 

Consider the test </? m , n = maxi<f<D w^n, then 

^IGo < Dx exp[-m*0 in (5)/8], 

SU P p r nl |G (1 -Vm,n) < exp[-m*g, n (5)/8]. 

The first inequality is due to (/? mj „ < X^2=i w m,n> and the second is due to the fact that for 
any G G Q n B dn (G 1 ,5/2) there is some cZ = 1, . . . ,D such that d«(G, G t ) < $g, n (8), 
sothatp 5[n] | G E 7> 2 (*)- □ 

Next, the existence of tests can be shown for discriminating Go against the complement 
of a closed Hausdorff ball: 

Lemma 9. Suppose that Q satisfies condition C. Fix Go G Q k . Suppose that for some 
non-increasing function D(e), some e m ,n > and every e > e m ^ n , 

sup D($g in (e),gn5 4 (Gi, e /2),4) 
Gieg 

x J D(e/2 j gn J B dw (G ,2e)\ J B w (Go,e),^)<D(e). (18) 

Then, for every e > e TO>n , a«<i awy to G N, f/iere erisf testa <p m>n ( depending on e > ) such 
that 

rDiam(g)/el 

P Go Vm,n < D{e) J2 exp[-m*g, n (te)/8] (19) 

t=to 

sup Pa{l-(p m , n ) < exp[-m*g )n (i e)/8]. (20) 

G&g:d n (G ,G)>t e 

Proof. The proof consists of a standard peeling device (e.g., bhosal et all J200dl ) and a 
packing argument as in the previous proof. For a given t £ N choose a maximal te/2- 
packing for set St = {G : te < <1%{Gq, G) < (t + l)e}. This yields a set S' t of at most 
D(te/2, St,d%) points. Moreover, every G G St is within distance te/2 of at least one of 
the points in S' t . For every such point G\ G S' t , there exists a test uj mjn satisfying Eqs. 
(fT6l ) and (fTTT ). where 5 is taken to be 6 = te. Take <^ mjn to be the maximum of all tests 
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attached this way to some point G\ G S' t for some t > to. Note that G G Q C A d , so 
t < [~Diam(£/)/e~|. Then, by union bound, and the condition that D(e) is non-increasing, 

r Diam(g) n 

^IGoVm.n < E D (^A^),0nB dn (G 1 ,te/2),d n 
t=to Gi€S' t ^ 
exp[-m^g jn (te)/8] 

< D(e)^exp[-m*g, n (te)/8], 



t>t 



and 



sup lG (I - (p n ) < sup exp[-m^g; >n (tte)/8] 

Geu u > tQ S u ln| u>t 

< exp[-m^g in (t e)/8], 
where the last inequality is due the monotonicity of \p0 jn (-). 



□ 



Proof of the abstract posterior contraction theorem (Theorem |4]l 

Proof. In thi s proof, to simplify notations denote P G ■= Ps< n ]\G an ^ so on - By a result of 
Ghosal et al BGhosal et all 120001 (Lemma 8.1, pg. 524), for every e > 0, C > and every 
probability measure Ilo supported on the set Bk{Gq, e) defined by Eq. ©, we have, 

Pa* ( I ft E ^T^ d ^(G) < exp(-(l + C)me")) < ! 



lPGo( S [n] 



C 2 me 2 



This entails that, by fixing C = 1, there is an event A m with Pq q -probability at least 
1 — (me^j n ) , for which there holds: 



l[ PG (Si n] )/p Go (Si n] )dU(G) > exp(-2me 2 m>n )U(B K (G ,e min )). (21) 



i=l 



Let O m = {G G Q* : dy,(G ,G) > M m e m) „}. Due to Eq.©, the condition specified by 
Lemma|9]is satisfied by setting D(e) = exp(me^ n ) (constant in e). Thus there exist tests 
Vm.n for which Eq. ( fl9l ) and (1201 hold. Then, 

P Go U(G G O m \S [ $) 

= p Go & m ,nii(G g o m \s [ ^)} + p Go [(i - ^ m ,„)n(G G o m \s l ™ ] )] 

< P Go Knn(GeO m |5W)] + P G K 



+P G0 



(1 - ^„)n(G G O m \S [ ^)l(A m ) 
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Applying Lemma |9j the first term in the preceeding display is bounded above by 

PGo<Pm,n < D(e m>n ) ^ exp[-m^g m>n (je mtn ) /8] ->• 0, 



j>Mr, 

thanks to Eq. (fTTT) . The second term in the above display is bounded by (mt^ n ) -1 by the 
definition of A m , so this term vanishes. It remains to show that third term in the display 
also vanishes as m — > oo. By Bayes' rule, 

and then obtain a lower bound for the denominator by Eq. (|2"TV For the nominator, by 
Fubini's theorem: 

„ m 
P Go / (1 " <p m ,n) HpG(Si n] )/pG (Si n] )dn(G) 

J Om^Qm i=l 

)dU(G) <exp[-m^g mtn {M m e m>n }/8], (22) 
where the last inequality is due to Eq. ( |20l ). In addition, by ([8]), 

„ m 

Pg / (1 " <Pm,n) HpG(Si n] )/ PGo (S{ n] )dn(G) 

Jo m \g m 7 ' — 1 

= f p G {i - vm,n)dn(G) < n(g* \ g m ). (23) 

JO m \Q 

Now, combining bounds (1221) and d23l) with condition ( flOl ), we obtain: 

P Go (l - ^m,n)n(G G O m |5jj ] )I(An) 

n(g* \ g m ) + exp[-m^ Go , n (g m , M m e m , ra )/8] 
exp(-2me^ n )n(5/<(G ,e TOin )) 

The upper bound in the preceeding display converges to by Eq. (TTTb . thereby concluding 
the proof. 

□ 

References 

A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and Y. K. Liu. Two SVDs suffice: 
Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation. 
arXiv: 1204.6703, 2012. 



29 



S. Arora, R. Ge, and A. Moitra. Learning topic models - going beyond SVD. 
arXiv: 1204.1956, 2012. 

A. Barron, M. Schervish, and L. Wasserman. The consistency of posterior distributions in 
nonparametric problems. Ann. Statist, 27:536-561, 1999. 

D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. /. Mach. Learn. Res, 3: 
993-1022, 2003. 

L. Le Cam. Asymptotic methods in statistical decision theory. Springer- Verlag, 1986. 

J. Chen. Optimal rate of convergence for finite mixture models. Annals of Statistics, 23(1): 
221-233, 1995. 

L. Evans and R. Gariepy. Measure theory and fine properties of functions. CRC Press, 
1992. 

S. Ghosal and A. van der Vaart. Convergence rates of posterior distributions for noniid 
observations. Ann. Statist., 35(1): 192-223, 2007. 

S. Ghosal, J. K. Ghosh, and A. van der Vaart. Convergence rates of posterior distributions. 
Ann. Statist., 28(2):500-531, 2000. 

J. K. Ghosh and R. V. Ramamoorthi. Bayesian nonparametrics . Springer, 2002. 

H. Ishwaran, L. James, and J. Sun. Bayesian model selection in finite mixtures by marginal 
density decompositions. Journal of American Statistical Association, 96(456):1316- 
1332, 2001. 

X. Nguyen. Convergence of latent mixing measures in finite and infinite mixture models. 
Annals of Statistics, to appear, 2012. 

J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilo- 
cus genotype data. Genetics, 155:945-959, 2000. 

J. Rousseau and K. Mengersen. Asymptotic behaviour of the posterior distribution in over- 
fitted mixture models. Journal of the Royal Statistical Society: Series B, 73(5):689-710, 
2011. 

R. Schneider. Convex bodies: Brunn-Minkowsky theory. Cambridge University Press, 1993. 

X. Shen and L. Wasserman. Rates of convergence of posterior distributions. Ann. Statist., 
29:687-714, 2001. 

W. Toussile and E. Gassiat. Model based clustering using multilocus data with loci selec- 
tion. Advances in Data Analysis and Classification, 3:109-134, 2009. 

S. van de Geer. Empirical processes in M-estimation. Cambridge University Press, 2000. 



30 



Cedric Villani. Optimal transport: Old and New. Springer, 2008. 

S. Walker. New approaches to bayesian consistency. Ann. Statist., 32(5):2028-2043, 2004. 

S. Walker, A. Lijoi, and I. Prunster. On rates of convergence for posterior distributions in 
infinite-dimensional models. Ann. Statist., 35(2):738-746, 2007. 

W. H. Wong and X. Shen. Probability inequalities for likelihood ratios and convergences of 
sieves mles. Ann. Statist., 23:339-362, 1995. 

B. Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien he Cam, pages 423-435, 1997. 



31 



